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ABSTRACT 

We study in this paper provenance information for queries 
with aggregation. Provenance information was studied in the 
context of various query languages that do not allow for ag- 
gregation, and recent work has suggested to capture prove- 
nance by annotating the different database tuples with ele- 
ments of a commutative semiring and propagating the anno- 
tations through query evaluation. We show that aggregate 
queries pose novel challenges rendering this approach inap- 
plicable. Consequently, we propose a new approach, where 
we annotate with provenance information not just tuples but 
also the individual values within tuples, using provenance to 
describe the values computation. We realize this approach 
in a concrete construction, first for "simple" queries where 
the aggregation operator is the last one applied, and then for 
arbitrary (positive) relational algebra queries with aggrega- 
tion; the latter queries are shown to be more challenging in 
this context. Finally, we use aggregation to encode queries 
with difference, and study the semantics obtained for such 
queries on provenance annotated databases. 

1. INTRODUCTION 

The annotation of the results of database transformations 
with provenance information has quite a few applications 
[IlIlIMIlilllllMElEllEaEiEaEllil. Recent 
work |24l 1171 [21] has proposed a framework of semiring an- 
notations that allows us to state formally what is expected of 
such provenance information. These papers have developed 
the framework for the positive fragment of the relational 
algebra (as well as for Datalog, the positive Nested Rela- 
tional Calculus, and some query languages on trees/XML). 
The main goal of this paper is to extend the framework to 
aggregate operations. 

In the perspective promoted by these papers, provenance 
is a general form of annotation information that can be spe- 
cialized for different purposes, such as multiplicity, trust, 
cost, security, or identification of "possible worlds" which 
in turn applies to incomplete databases, deletion propaga- 
tion, and probabilistic databases. In fact, the introduction 
of the framework in [24] was motivated by the need to track 
trust and deletion propagation in the Orchestra system [33] . 
What makes such a diversity of applications possible is that 
each is captured by a different semiring, while provenance is 
represented by elements of a semiring of polynomials. One 
then relies on the property that any semiring-annotation 
semantics factors through the provenance polynomials se- 
mantics. This means that storing provenance polynomials 
allows for many other practical applications. For example, 
to capture access control, where the access to different tuples 
require different security credentials, we can simply evalu- 
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Figure 1: Projection on annotated relations 

ate the polynomials in the security semiring, and propagate 
the security annotations through query evaluation (see Sec- 
tion [53}, assigning security levels to query results. 

Let us briefly illustrate deletion propagation as an appli- 
cation of provenance. Consider a simple example of an em- 
ployee/department/salary relation R shown in Figure [TJ a). 

The variables pi,p2,P3, fi, r% can be thought of as tuple 
identifiers and in the framework of provenance polynomi- 
als [24] they are the "provenance tokens" or "indeterminates" 
out of which provenance is built. We denote by N[X] the 
set of provenance polynomials (here X = {pi,P2,P3, r\, r%}). 
R can be seen as an N[X]-annotated relation; as defined in 
[24| the evaluation of query, for example YlmptR, produces 
another N[X]-annotated relation, in this example the one 
shown in Figure QJb). Intuitively, in this simple example, 
the summation in the annotation of every result tuple is 
over the identifiers of its alternative origins Q 

Now, the result of propagating the deletions of tuples with 
Empid 3 and 5 in R is obtained by simply setting ps = 
r 2 = in the answer. We get the same two tuples in the 
query answer but their provenances change to pi + p2 and 
ri, respectively. If the tuple with Empid 4 is also deleted 
from R then we also set n = 0, and the second tuple in the 
answer is deleted because its provenance has now become 
0. This algebraic treatment of deletions is related to the 
counting algorithm for view maintenance [26| . but is more 
general as it incrementally maintains not just the data but 
also the provenance. 

An intuitive way of understanding what happens is that 
provenance-aware evaluation of queries conveniently "com- 
mutes" with deletions. In fact, in [241 117] this intuition is 
captured formally by theorems that state that query eval- 
uation commutes with semiring homomorphisms. The fac- 
torization through provenance relies on this and on the fact 
that the polynomial provenance semiring is "freely gener- 
ated". All applications of provenance polynomials we have 
listed, for trust, security, etc., are based on these theorems. 



x we explain how the annotations of query results are com- 
puted in Section r2.ll 



Dept 


SalMass 




di 


45 


Tit Tin Tin 

P\P2PA 


di 


30 


Tit Tin Tin 

P1P2PZ 


di 


35 


Plp~2P3 


di 


25 


PlP2PZ 


di 


20 


P1P2P3 


di 


10 


P1P2P0 


di 


15 


P1P2P3 



Dept SalMass 




di 30 
di 20 
di 10 


P1P2 
P\P2 
P\P2 



(a) 



Figure 2: A naive approach to aggregation 

Thus, commutation with homomorphisms is an essential 
criterion for our proposed framework extension to aggregate 
operations. However, in Section l3~T1 we prove that the frame- 
work of semiring-annotated relations introduced in [24] can- 
not be extended to handle aggregation while both satisfying 
commutation with homomorphisms and working as usual on 
set or bag relations. 

If the semiring operations are not enough then perhaps 
we can add others? This is a natural idea so we illustrate it 
on the same 7? in Figure H^a) and we use again the neces- 
sity to support deletion propagation to guide the approach. 
Consider the query that groups by Dept and sums Sal. The 
result of the summation depends on which tuples partici- 
pate in it. To provide enough information to obtain all the 
possible summation results for all possible sets of deletions, 
we could use the representation in Figure[2ja) where we add 
to the semiring operations an unary operation ^ with the 
property that p — 1 whenever p = 0. This will indeed sat- 
isfy the deletion criterion. For example when the tuple with 
Id 3 is deleted we get the relation in Figure EJb). In fact, 
there exist semirings with the additional structure needed to 
define " . For example in the semiring of polynomials with 
integer coefficients, 1\X], we can take p — 1 — p while in 
the semiring of boolean expressions with variables from X, 
BoolExp(X), we can take p = The latter is essentially 

the approach taken in [31]. However, whether we use Z[X] 
or BoolExp(X), we still have, in the worst case, exponen- 
tially many different results to account for, at least in the 
case of summation (a lower bound recognized also in [31]). It 
follows that summation in particular (and therefore any uni- 
form approach to aggregation) cannot be represented with 
a feasible amount of annotation as long as the annotation 
stays at the tuple level. 

Instead, we will present a provenance representation for 
aggregation results that leads only to a poly-size increase, 
one that we believe is tractably implementable using meth- 
ods similar to the ones used in Orchestra [23]. We achieve 
this via a more radical approach: we annotate with prove- 
nance information not just the tuples of the answer but also 
the manner in which the values in those tuples are computed. 

We can gain intuition towards our representation from the 
particular case of bags, which are in fact N-relations, i.e., 
relations whose tuples are annotated with elements of the 
semiring (N, +, •, 0, 1). Assume that R in FigureQJa) is such 
a relation, i.e., pi, . . . , r% £ N are tuple multiplicities. Then, 
after sum-aggregation the value of the attribute SalMass in 
the tuple with Dept di is computed by pi x 20 + P2 X 10 + 
P3 x 15. Now, if the multiplicities are, for example p\ — 
2,P2 = 3,p3 = 1 then the aggregate value is 85. But what 
if R is a relation annotated with provenance polynomials 



rather than multiplicities? Then, the aggregate value does 
not correspond to any number. 

We will make p\ x 20 into an abstract construction, p\ % 20 
and the aggregate value will be the formal expression p\ ® 
20 + P2 ® 10+P3 ® 15. 

Intuitively, we are embedding the domain of sum-aggregates, 
i.e., the reals R, into a larger domain of formal expres- 
sions that capture how the sum-aggregates are computed 
from values annotated with provenance. We do the same 
for other kinds of aggregation, for instance min-aggregation 
gives pi ®20 minp2 ® 10 minp3 ® 15. We call these annotated 
aggregate expressions. 

In this paper we consider only aggregations defined by 
commutative-associative operationsQ Specifically, our frame- 
work can accommodate aggregation based on any commuta- 
tive monoid. For example the commutative monoid for sum- 
mation is (R, +, 0) while the one for min is (R°°, min, 00 )Q- 

To combine an aggregation monoid M with an annotation 
commutative semiring K, in a way capturing aggregates over 
7f -relations, we propose the use of the algebraic structure of 
K -semimodules (see Section ^. 2[) . Semimodules are a way of 
generalizing (a lot!) the operations considered in linear alge- 
bra. Its "vectors" form only a commutative monoid (rather 
than an abelian group) and its "scalars" are the elements 
of K which is only a commutative semiring (rather than a 
field). 

In general, a commutative monoid M does not have an 
obvious structure of A'-semimodule. To make it such we 
may need to add new elements corresponding to the scalar 
multiplication of elements of M with elements from K, thus 
ending up with the formal expressions that represent aggre- 
gate computations, motivated above, as elements of a tensor 
product construction K®M. We show that the use of tensor 
product expressions as a formal representation of aggrega- 
tion result is effective in managing the provenance of "simple 
aggregate" queries, namely queries where the aggregation 
operators are the last ones applied. 

We show that certain semirings are "compatible" with 
certain monoids, in the sense that the results of compu- 
tation done in K ® M may be mapped to M, faithfully 
representing the aggregation results. Interestingly, compat- 
ibility is aligned with common wisdom: it is known that 
some (idempotent) aggregation functions such as MIN and 
MAX work well for set relations, while SUM and PROD re- 
quire the treatment of relations as bags. We show that non- 
idempotent monoids are compatible only with "bag" semir- 
ings, i.e. semirings from which there exists an homomor- 
phism to N. 

In general, aggregation results may be used by the query 
as the input to further operators, such as value-based joins 
or selections. Here the formal representation of values leads 
to a difficulty: the truth values of comparison operators on 
such formal expressions is undetermined! Consequently, we 
extend our framework and construct semirings in which for- 
mal comparison expressions between elements of the corre- 
sponding semimodule are elements of the semiring. This 
means that an expression like [pi ® 20 = P2 ® 10 + ps 8> 15] 
may be used as part of the provenance of a tuples in the join 
result. This expression is simply treated as a new prove- 
nance token (with constraints) , until p\,P2, P3 are assigned 
e.g. values from B or N, in which case we can interpret 



2 Both of these also give natural semantics to relational dif 
ference. Z[X] is used in [5D] following the use of Z in [22| 
while BoolExp(A) is used in the seminal paper [28] . 



3 As shown in [32], for list collections it also makes sense to 
consider non-commutative aggregations. 
4 jf-annotated relations with union (see Section 12. ip also 
form such a structure. 



both sides of the equality as elements of the monoid and de- 
termine the truth value of the equality (see Section [4]). We 
show in Section[3]that this construction allows us to manage 
provenance information for arbitrary queries with aggrega- 
tion, while keeping the representation size polynomial in the 
size of the input database. Our construction is robust: if fur- 
ther queries are applied, the token [pi®20 = P2® 10+P3<S>15] 
can be used as part of a more complex expression, just as 
any other provenance token. 

The main result of this paper is providing, for the first 
time, a semantics for aggregation (including group by) on 
semiring-annotated relations that: 

• Coincides with the usual semantics on set/bag rela- 
tions for min/max/sum/prod. 

• Commutes with homomorphisms of semirings (hence 
all the ensuing applications) 

• Is representable with only poly-size overhead. 

A second result of this paper is a new semantics for difference 
on relations annotated with elements from any commutative 
semiring. This is done via an encoding of relational differ- 
ence using nested aggregation. The fact that such an en- 
coding can be done is known (see e.g. [291 |9"] ). but combined 
with our provenance framework, the encoding gives a seman- 
tics for "provenance-aware" difference. Our new semantics 
for R — S is a hybrid of bag-style and set-style semantics, in 
the sense that tuples of R that appear in S do not appear in 
R — S (i.e. a boolean negative condition is used), while those 
that do not appear in S appear in R — S with the same an- 
notation (multiplicity, if K = N is used) that they had in R. 
This makes the semantics different from the bag-monus se- 
mantics and its generalization to "monus-semirings" in [19] 
as well as from the "negative multiplicities" semantics in [22] 
(more discussion in Section [6]). We examine the equational 
laws entailed by this new semantics, in contrast to those of 
previously proposed semantics for difference. In our opinion, 
this semantics is probably not the last word on difference of 
annotated relations, but we hope that it will help inform 
and calibrate future work on the topic. 

Paper Organization. The rest of the paper is organized as 
follows. Section [2] describes and exemplifies the main math- 
ematical ingredients used throughout the paper. Section 
[3] describes our proposed framework for "simple" aggrega- 
tion queries, and this framework is extended in Section[J]to 
nested aggregation queries. We consider difference queries 
in Section[5l Related Work is discussed in Section[6] and we 
conclude in Section [7] 

2. PRELIMINARIES 

We provide in this section the algebraic foundations that 
will be used throughout the paper. We start by recalling the 
notion of semiring and its use in [21] to capture provenance 
for the SPJU algebra queries. We then consider aggregates, 
and show the new algebraic construction that is required to 
accurately support it. 

2.1 Semirings and SPJU 

We briefly review the basic framework introduced in [24| . 
A commutative monoid is an algebraic structures (M, + M , M ) 
where + M is an associative and commutative binary op- 
eration and M is an identity for + flJ . A monoid homo- 
morphism is a mapping h : M —¥ M' where M, M' are 
monoids, and h(0 M ) = , ,h(a + b) — h(a) + h(b). We 
will consider database operations on relations whose tuples 



are annotated with elements from commutative semirings. 
These are structures (K, + K , - K , K , 1 K ) where (K, + K , K ) 
and (K,- K ,1 K ) are commutative monoids, - K is distribu- 
tive over + K , and a - K K — - K a — K . A semiring 
homomorphism is a mapping h : K — > K' where K, K' 
are semirings, and h(0 K ) = ,,h(l K ) = 1 K , ,h(a + b) = 
h(a) + h(b), h(a ■ b) = h(a) ■ h(b). Examples of commuta- 
tive semirings are any commutative ring (of course) but also 
any distributive lattice, hence any boolean algebra. Exam- 
ples of particular interest to us include the boolean semiring 
(B, V, A, -L, T) (for usual set semantics), the natural num- 
bers semiring (N, +, -,0, 1) (its elements are multiplicities, 
i.e., annotations that give bag semantics), and the secu- 
rity semiring (§, min, max, f^, lg) where S is the ordered set, 
1 S <C<S<T<Q 5 whose elements have the following 
meaning when used as annotations: lg : public ("always 
available"), C : confidential, S : secret, T : top secret, and 
Qj means "never available". 

Certain semirings play an essential role in capturing prove- 
nance information. Given a set X of provenance tokens 
which correspond to "atomic" provenance information, e.g., 
tuple identifiers, the semiring of polynomials (N[A], +, ■, 0, 1) 
was shown in [24] to adequately, and most generally, capture 
provenance for positive relational queries. The provenance 
interpretation of the semiring structure is the following. The 
+ operation on annotations corresponds to alternative use 
of data, the • operation to joint use of data, 1 annotates data 
that is always and unrestrictedly available, and annotates 
absent data. The definition of the if-relational algebra (see 
bellow for union, projection and join) fits indeed this inter- 
pretation. Algebraically, N[X] is the commutative semir- 
ing freely generated by X, i.e., for any other commutative 
semiring K, any valuation of the provenance tokens X — > K 
extends uniquely to a semiring homomorphism N[X] — > K 
(an evaluation in K of the polynomials). We say that any 
semiring annotation semantics factors through the prove- 
nance polynomials semantics, which means that for practical 
purposes storing provenance information suffices for many 
other applications too. Other semirings can also be used 
to capture certain forms of provenance, albeit less generally 
than N[X] [241 121] . For example, boolean expressions cap- 
ture enough provenance to serve in the intensional semantics 
of queries on incomplete 28 and probabilistic data [181 140] . 

To define annotated relations we use here the named per- 
spective of the relational model pQ. Fix a countably infinite 
domain D of values (constants). For any finite set U of at- 
tributes a tuple is a function t : U — >• D and we denote the 
set of all such possible tuples by B u . Given a commuta- 
tive semiring K, a K -relation (with schema U) is a function 
R : B u — » K whose support, supp(i?) = {t \ R(i) Q K } is 
finite. For a fixed set of attributes U we denote by A"-Rel 
(when U is clear from the context) the set of Tf-relations 
with schema U. We also define a K-set to be a function 
S : D — > K again of finite support. We then define: 

Union If R4 : D u -»• K, i = 1, 2 then Ri U K R 2 : B u -* K 
is defined by (i?i VJ K R 2 )(t) = Ri(t) + K R 2 (t). The 
definition of union of A'-sets follows similarly. 

We also define the empty if -relation (K-set) by % K (t) — 
K . It is easy to see that (K-Kel, U K ,$ K ) is a commutative 
monoid ^-Similarly, we get the commutative monoid of K- 
sets (if-Set, U K ,0 K ). 

Given a named relational schema, if-databases are defined 
from /("-relations just as relational databases are defined 

5 In fact, it also has a semiring structure. 



from usual relations, and in fact the usual (set semantics) 
databases correspond to the particular case K = B. The 
(positive) K -relational algebra defined in |24] corresponds to 
a semantics on Jf-databases for the usual operations of the 
relational algebra. We have already defined the semantics 
of union above and we give here just two other cases leaving 
the rest for Appendix [X] (for a tuple t and an attributes set 
U' , t\xji is the restriction of t to U'): 

Projection If R : H u -> K and U' C U then Uu,R : 
H u ' -> K is defined by (U uf R){t) = where 
the + K sum is over all t' G supp(i?) such that t'\xji = t. 

Natural Join If R t : B Ui -+ K, i = 1, 2 then iii M i?2 : 
D t/iUt/ 2 K ^ denned by (i? x >a fl 2 )(t) = 

#2 (£2) where tj = t[ui, i = l,2. 

2.2 Semimodules and aggregates 

We will consider aggregates defined by commutative monoids. 
Some examples are SUM = (R, +, 0) for summation^, MIN = 
(R ±oc , min, +00) for min, MAX = (R ±txl , max, -co) for max, 
and PROD = (R, x, 1) for product. 

In dealing with aggregates we have to extend the oper- 
ation of a commutative monoid to operations on relations 
annotated with elements of semirings. This interaction will 
be captured by semimodules. 

Definition 2.1. Given a commutative semiring K , a struc- 
ture (W, + w ,0 w ^*w) ts a K-semimodule if (W, + w , 0„, ) is a 
commutative monoid and *„, is a binary operation K xW — > 
W such that (for all fc, kx, k 2 G K and w, Wx, u? 2 G W): 



k V (wi + w w 2 ) 
k *w Ow 
(kx + K k 2 )* w w 

Or: *w w 
(fci - K k 2 ) * w W 



k * w w i +w k V w 2 (!) 

0w (2) 

ki * w w + w k 2 V w (3) 

0w (4) 

ki * w {k 2 *„,. w) (5) 

!k *w w = w (6) 

In any (commutative) monoid (M, + M , M ) define for any 
n G N and x £ M 



nx = x +» 



+ M a; ("- times) 



in particular Cte = M . Thus M has a canonical struc- 
ture of N-semimodule. Moreover, it is easy to check that 
a commutative monoid M is a B-semimodule if and only if 
its operation is idempotent: x + M x = x. The A'-relations 
themselves form a if-semimodule (_K"-Rel, \J K ,% K , * K ) where 

(k* K R){t) = k- K fl(t)Q 

We now show, for any Tf-semimodule W, how to define 
W -aggregation of a A"-set of elements from W. We assume 
that W C D and that we have just one attribute, whose 
values are all from W. Consider the K-set S such that 
supp(S) = {wi, . . . , w n } and S(wi) = ki G K, i = 1, . . . , n 
(i.e., each Wi is annotated with fci). Then, the result of 
W-aggregating S is defined as 

SetAgg w (S) = kx v »i +w \- w k n * w w n eW 

For the empty 7^-set we define Set Agg H , ((L) = 0,^ . Clearly, 
SetAgg^ is a semimodule homomorphism^. Since all com- 
mutative monoids are N-semimodules this gives the usual 

6 COUNT is particular case of summation and AVG is ob- 
tained from summation and COUNT. 
7 In fact, it is the K semimodule freely generated by D u . 
8 In fact, it is the free homomorphism determined by the 
identity function on W. 



sum, prod, min, and max aggregations on bags. Since MIN 
and MAX are B-semimodules this gives the usual min and 
max aggregation on sets0. 

Note that SetAgg is an operation on sets, not an operation 
on relations. In the sequel we show how to extend it to one. 

2.3 A tensor product construction 

More generally, we want to investigate M-aggregation on 
AT-relations where M is a commutative monoid and K is 
some commutative semiring. Since M may not have enough 
elements to represent A"-annotated aggregations we construct 
a if-semimodule in which M can be embedded, by transfer- 
ring to semimodules the basic idea behind a standard alge- 
braic construction, as follows. 

Let K be any commutative semiring and M be any com- 
mutative monoid. We start with K x M, denote its elements 
fc®m instead of (fc, m) and call them "simple tensors". Next 
we consider (finite) bags of such simple tensors, which, with 
bag union and the empty bag, form a commutative monoid. 
It will be convenient to denote bag union by + Kmi , the 
empty bag by Kmf and to abuse notation denoting single- 
ton bags by the unique element they contain. Then, every 
non-empty bag of simple tensors can be written (repeating 
summands by multiplicity) kx®mx + KISM ■ • ■ + KmM k n ®m n . 
Now we define 

k * k ®m 12 ki®m l =J2(k' K fc l )®m l 

Let ~ be the smallest congruence w.r.t. + K ® M and * KISM 
that satisfies (for all k, k' ,m,m'): 



(fc + K fc')®m ~ fc®m + KSlM k't 
m+ M m') ~ k®m + AW k®m' 



mo,, 



r , 



We denote by K ® M the set of tensors i.e., equivalence 
classes of bags of simple tensors modulo ~. We show in 
Appendix ITU that K ® M forms a A"-semimodule. 

Lifting homomorphisms. Given a homomorphism of semir- 
ings h : K K' , and some commutative monoid M, we can 
"lift" fttoa homomorphism of monoids in a natural way. The 
lifted homomorphism is denoted h M : K <g> M — > K ' g) M 
and defined by: 

h M (J2ki®mi) = Y^h{ki)®m,i 

3. SIMPLE AGGREGATION QUERIES 

In this section we begin our study of the "provenance- 
aware" evaluation of aggregate queries, focusing on "sim- 
ple" such queries, that is, queries in which aggregations are 
done last; for example, un-nested SELECT FROM WHERE 
GROUP BY queries. This avoids the need to compare values 
which are the result of annotated aggregations and simpli- 
fies the treatment. The restriction is relaxed in the more 
general framework presented in Section [4] 

The section is organized as follows. We list the desired 
features of a provenance-aware semantics for aggregation, 
and first try to design a semantics with these features, with- 
out using the tensor product construction, i.e. by simply 



The fact that the right algebraic structure to use for ag- 
gregates is that of semimodules can be justified in the same 
way in which using semirings was justified in |24j : by show- 
ing how the laws of semimodules follow from desired equiv- 
alences between aggregation queries. 



working with A"-relations as done in [24] . We show that this 
is impossible. Consequently, we turn to semantics that are 
based on combining aggregation with values via the tensor 
product construction. We propose such semantics that do 
satisfy the desired features, first for relational algebra with 
an additional AGG operator on relations (that allows aggre- 
gation of all values in a chosen attributes, but no grouping); 
and then for GROUP BY queries. 

3.1 Semantic desiderata and first attempts 

We next explain the desired features of a provenance- 
aware semantics for aggregation. To illustrate the difficulties 
and the need for a more complex construction, we will first 
attempt to define a semantics on A-relations, without using 
the tensor product construction of Section 12.31 

We consider a commutative semiring A (e.g., B, N, N[X], §, 
etc.) for tuple annotations and a commutative monoid M 
(e.g., SUM= (R,+,0),PROD = (R, x, 1), MAX = 
(R~°°,max, -co), MIN = (R°°,min, oo) etc.) for aggrega- 
tion. We will assume that the elements of M already belong 
to the database domain, MCI. 

We have recalled the semantics of SPJU queries in Section 
12.11 Now we wish to add an M-aggregation operation AGG 
on relations. We then denote by SPJU-A the restricted class 
of queries consisting of any SPJU-expression followed pos- 
sibly by just one application of AGG. This corresponds to 
SELECT AGG(*) FROM WHERE queries (no grouping). 

For the moment, we do not give a concrete semantics to 
AGGm{R), allowing any possible semantics where the re- 
sult of AGGaz(-R) is a A-relation. We note that AGGm(-R) 
should be defined iff J? is a A-relation with one attribute 
whose values are in M. 

What properties do we expect of a reasonable semantics 
for SPJU-A (including, of course, a semantics for AGGa/(.R))? 
A basic sanity check is 

Set/Bag Compatibility The semantics coincides with the 
usual one when A = B (sets) and M = MAX or MIN, 
and when A = N (bags) and M = SUM or PROD. 

Note that we associate min and max with sets and sum and 
product with bags. Min and max work fine with bags too, 
but we get the same result if we convert a bag to a set (elimi- 
nate duplicates) and then apply them. Sum and product (in 
the context of other operations such as projection) require 
us to use bags semantics in order to work properly. This 
is well-known, but our general approach sheds further light 
on the issue by discussing such "compatibility" for arbitrary 
semirings and monoids in Section [3.41 

As discussed in the introduction, a fundamental desidera- 
tum with many applications is commutation with homomor- 
phisms. Note that a semiring homomorphism h : A —¥ A' 
naturally extends to a mapping h,R e i : A-Rel — > A'-Rel via 
hnei{R) — h o R (i.e. the homomorphism is applied on the 
annotation of every tuple), which then further extends to 
A-databases. With this, the second desired property is 

Commutation with Homomorphisms Given any two com- 
mutative semirings A, A' and any homomorphism h : 
A — > A', for any query Q, its semantics on A-databases 
and on A'-databases satisfy hjt e i(Q(D)) — Q(hR e i{D)) 
for any A-database D. 

It turns out that this property determines quite precisely the 
way in which tuple annotations are defined. We say that the 
semantics of an operation fl on A-databases is algebraically 
uniform with respect to the class of commutative semirings if 



the annotations of the output Q(D) are defined by the same 
(for all A) {+ K , - K , K , 1 K }-expressions, where the elements 
in the expressions are the annotations of the input D. One 
can see that the definition of the SPJU-algebra is indeed 
algebraically uniform and was shown in [241 117] to com- 
mute with homomorphisms. The connection between the 
two properties is general (proof deferred to the Appendix): 

Proposition 3.1. A semantics commutes with homomor- 
phisms iff it is algebraically uniform. 

After stating two of the desired properties, namely set /bag 
compatibility and commutation with homomorphisms we 
can already show that it is not possible to give a satisfactory 
semantics to the SPJU-A algebra within the framework used 
in [24] for the SPJU-algebra. 

Proposition 3.2. There is no K -relation semantics for 
MAX- (or MIN- ) aggregation that is both set-compatible and 
commutes with homomorphisms. Similarly, there is no K- 
relation semantics for SUM- aggregation that is both bag- 
compatible and commutes with homomorphisms. 

Proof. Assume by contradiction the existence of such 
semantics. Consider the N[A]-relation R with one attribute 
and two tuples with values 10 and 20, with the corresponding 
tuple annotations being x,y £ X. Let R' be AGG A1AX (i?) 
according to the assumed semantics; R 1 is also an N[A]- 
relation. Because a tuple t with a value 10 is a possible 
answer to the MAX-aggregation (when we set y = 0) it 
must occur in supp(_R'). Let p £ N[X] be the annotation 
of the tuple t (having value 10) in R' . By algebraic unifor- 
mity the only variables that can occur in p are x and y, and 
we consequently denote it p(x,y). Consider two homomor- 
phisms ti, h" : N[X] -+ B defined by h'{x) = h'{y) = T and 
by h"(x) = T,h"(y) = _L Applying AGG^ to ti Ret (R) 
and h" Rei(R) should, by set-compatibility, work as usual. 
Hence, by commutation with homomorphisms h'(p) — J_ 
and h"(p) — T. Functions on B defined by polynomials in 
N[X] are monotone in each variable. But _L = h'(p(x,y)) — 
p(h'(x), ti(y)) = p(T, T) and T = h"(p(x, y)) = p(h"(x), h" (»)) = 
p(T, _L), in contradiction to the monotonicity. 

□ 

Alternatively, one may consider going beyond semirings, 
to algebraic structures with additional operations. We have 
briefly explored the use of "negative" information in the in- 
troduction. As we show there, one could use the ring struc- 
ture on Z[X] (the additional subtraction operation) or the 
boolean algebra structure on BoolExp(A) (the additional 
complement operation) but the use of negative operation 
does not avoid the need to enumerate in separate tuples of 
the answer all the possible aggregation results given by sub- 
sets of the input. In the case of summation, at least, there 
are exponentially many such tuples. We reject such an ap- 
proach and we state as an additional desideratum: 

Poly-Size Overhead For any query Q and database D, 
the size of Q(D), including annotations, should be only 
polynomial in the size of D. 

We shall next show a semantics to the SPJU-A -algebra 
that satisfies all three properties we have listed. 

3.2 Annotations ® values and SPJU-A 

Let us fix a commutative monoid M (for aggregation) and 
a commutative semiring A (for annotation). The inputs 



of our queries are as before: if-databases whose domain D 
includes the values M over which we aggregate. However, 
the outputs are more complicated. The basic idea for the 
semantics of aggregation was already shown in Section 12.21 
where it is assumed that the domain of aggregation has a 
if-semimodule structure. As we have shown in Section [2. 31 
we can give a tensor product construction that embeds M 
in the if-semimodule K ® M (note that this embedding is 
not always faithful, as discussed in Section [3.4[) . 

For the output relations of our algebra queries, we thus 
need results of aggregation (i.e., the elements of K <g> M) 
to also be part of the domain out of which the tuples are 
constructed. Thus for the output domain we will assume 
that K <g) M C D, i.e. the result "combines annotations 
with values". The elements of M (e.g., real numbers for 
sum or max aggregation) are still present, but only via the 
embedding t : M — > K ® M defined by t(m) = l K ®m. 

Having annotations from K appear in the values will change 
the way in which we apply homomorphisms to query results, 
so to emphasize the change we will call (M, K) -relations the 
^-annotated relations over such that the data domain D 
that includes K ® M. To summarize, the semantics of the 
SPJU-A -algebra will map databases of /("-relations (with 
M C D) to (M, A")-relations (with K ® M C D). 

As we define the semantics of the SPJU-A -algebra, we 
first note that for selection, projection, join and union the 
definition is the same as for the SP JU-algebra on A"-databases. 
The last step of the query is aggregation, denoted AGGjvr (R), 
and is well-defined iff R is a /("-relation with one attribute 
whose values are in the M subset of D. To apply the def- 
inition that uses the semimodule structure (shown in Sec- 
tion [22]), we convert R to an (M, if )-relation l(R) by re- 
placing each m £ M with t(m) = l K ®m £ K ®M. Then, if 
supp(i?) = {mi, . . . , m n } and R(rrii) = ki € K, i = 1, . . . , n 
(i.e., each mi is annotated with ki) we define AGG m(R) as 
a one-attribute relation with one tuple annotation is 1 K and 
whose content is Set Agg K „ M (l(R)) , which is equal to 



(m n ) 



We define the annotation of the only tuple in the output of 
AGGa/ to be 1 K , since this tuple is always available. How- 
ever, the content of this tuple does depend on R. For exam- 
ple, even when R is empty the output is not empty: by the 
semimodule laws, its content is K ® M = t(0 M ). 

Commutation with Homomorphisms. We have explained 
in Section [231 now to lift a homomorphism h : K — > K' to a 
homomorphism h M : K®M — > K' ®M. Via this we can lift 
h to a homomorphism hn e i on (M, K) -relations: let R be 
such a relation and recall that some values in R are elements 
of K ® M, and the annotations of these tuples are elements 
of K. Then Hr £ i (R) denotes the relation obtained from R by 
replacing every k £ K with h(k), and additionally replacing 
every k® M £ K ® M with h M (k ® m). All other values 
in R stay intact. Applying hn e i on a (M, A")-database D 
amounts to applying hn e i on each (M, if)-relation in D. 
We can now state the main result for our SPJU-A -algebra: 

Theorem 3.3. Let K,K' be semirings, h : K — s- K' , Q 
an SPJU-A query and let M be a commutative monoid. For 
every (M, K)- database D, Q(hn e i(D)) — hn e i(Q(D)) if and 
if h is a semiring homomorphism. 



queries homomorphism commutation was shown in [24], while 
commutation for the new AGGm construct follows directly 
from the definition. 

Example 3.4. Consider the following N[X]-relation R: 



Sal 




20 


ri 


to 


ri 


30 


r-i 



Let M be some commutative monoid, then AGGm(R) con- 



sist of a single tuple with value ri®20 + Km , r2®f0 + mM 
?"3®30. The intuition is that this value captures multiple 
possible aggregation values, each of which may be obtained 
by mapping the ri annotations to N, standing for the mul- 
tiplicity of the corresponding tuple. The commutation with 
homomorphism allows us to first evaluate the query and only 
then map the ri 's, changing directly the expression in the 
query result. For example, if M — SU M and we map n to 
l,r 2 to 0,r 3 to 2, we obtain f®20 + Km 2®30 = 1®20 + mM 



1O30 + K 



f®30 



80 (which corresponds to the M 



element 80). As another example, the commutation with 
homomorphisms allows us to propagate the deletion of the 
first tuple in R, by simply setting in the aggregation result 
ri = (keeping the other annotations intact) and obtaining 
2®30 = (1+1) ®30 = 1®30+1®30 = l®(30+30) = 1®60. 

We further demonstrate an application for security. 

Example 3.5. Consider the following relation R, anno- 
tated by elements from the security semiring S. 



Sal 




20 


S 


10 


Is 


30 


s 



The proof is by induction on the query structure, and 
is straightforward given that for the constructs of SPJU 



Recall (from Section \2.1[) the order relation \ < C < S < 
T < 0^; a user with credentials cred can only view tuples 
annotated with security level equal or less than cred. Now 
let M = MAX and we obtain: AGG MAX {R) = S®20 + mM 
]g®10 + Kmf S®30 = S®(20 +max 30) + ]g®10 and we get 
S®30 + ]g®10. 

Assume now that we wish to compute the query results as 
viewed by a user with security credentials cred. A naive com- 
putation would delete from R all tuples that require higher 
credentials, and re-evaluate the query (which in general may 
be complex). But observe that the deletion of tuples is equiv- 
alent to applying to R a homomorphism that maps every 
annotation t > cred to 0, and t < cred to 1. Using homo- 
morphism commutation we can do better by applying this 
homomorphism only on the result representation (namely 
S®30 + \ (8)W). For example, for a user with credentials C, 
we map S to and lg to I, and obtain 0®30+l®10 = 1®10; 
similarly for a user with credentials S we get 1®30 + 1®10 = 
1®(30 + MAX 10) = 1 ® 30. 

From the above definition of the semantics for aggrega- 
tion, it is obvious that the poly-size overhead property is 
fulfilled. Indeed, consider the case of provenance for sum- 
mation as in Example l3.4l and compare it to the naive repre- 
sentation provided in the Introduction. Instead of having to 
list all (exponentially many) options for the sum of salaries, 
we used an expression in K ® SUM that is of linear size with 
respect to the input to the aggregation. As exemplified, the 
possible aggregate answers now correspond to different valu- 
ations for the provenance tokens, applied to this expression. 



3.3 Group By 

So far we have considered aggregation in a limited con- 
text, where the input relation contains a single attribute. In 
common cases, however, aggregation is used on arbitrary re- 
lations and in conjunction with grouping, so we next extend 
the algebra to handle such an operation. The idea behind 
the construction is quite simple: we separately group the tu- 
ples according to the values of their "group-by" attributes, 
and the aggregated values for each such group are computed 
similarly to the computation for the AGG operator. When 
considering the annotation of the aggregated tuple, we en- 
counter a technical difficulty: we want this annotation to 
be equal \ K if the input relation includes at least one tuple 
in the corresponding group, and K otherwise (for intuition, 
consider the case of bag relations, in which the aggregated 
result can have at most multiplicity 1); we consequently en- 
rich our structure to include an additional construct 8 that 
will capture that, as follows: 

Definition 3.6. A (commutative) S-semiring is an alge- 
braic structure (K, + K ,- K ,0 K ,l K ,S K ) where (K, + K , - K , K , 1 K 
is a commutative semiring and S K : K — > K is a unary oper- 
ation satisfying the "5 -laws" 5k (0 K ) = K and S K (nl K ) = 1 K 
for alln > 1. If K and K' are 5 -semirings then a homomor- 
phism between them is a semiring homomorphism h : K —> 
K' , for which we have in addition h(S K (k)) — 5 , (h(k)). 

The (5-laws completely determine 5® and 8®. But they 
leave a lot of freedom for the definition of S K in other semir- 
ings; in particular for the security semiring, a reasonable 
choice for Ss is the identity function. 

As with any equational axiomatization, we can construct 
the commutative 5-semiring freely generated by a set X, de- 
noted N[X, S], by taking the quotient of the set of 
{+, ■, 0, 1, <5}-algebraic expressions by the congruence gener- 
ated by the equational laws of commutative semirings and 
the 5-laws. For example, if e and e' are elements of N[X, 8] 
(i.e., congruence classes of expressions given by some rep- 
resentatives) then e +n[x,6] e' is the congruence class of the 
expression e + e . The elements of N[X, 8] are not standard 
polynomials but certain subexpressions can be put in poly- 
nomial form, for example 8(2 + Sxy 2 ) or 3 + 2S(x 2 + 2y)z' 2 . 

We are now ready to define the group by (denoted GB) 
operation; subsequently we exemplify its use, including in 
particular the role of 8: 

Definition 3.7. Let R be a K -relation on set of attributes 
U , let U' C U be a subset of attributes that will be grouped 
and U" £ U be the subset of attributes with values in M (to 
be aggregated). We assume that U' n U" = 0. For a tuple t, 
we define T = {t' € supp(R) | Vu € U' t'(u) = t(u)}. 

We then define the aggregation result R' — GB V i jjh (R) 
as follows: 



5 K (E t , eT R(t')) T^<t>,and 
R'(t) = { Vu e U" t{u) = E t , eT R(t') <g> t'{u) 

Otherwise. 

Example 3.8. Consider the relation R: 



and a query GB{De Vt } t g a iR, where the monoid used is SUM. 
The result (denoted R') is: 



Dept Sal 




di n®20 + k ®sc/m r 2 ®10 
d 2 r 3 ®10 


5k (ri + K r 2 ) 
5k (r 3) 



Dept 


Sal 




di 


20 


ri 


di 


10 


r2 


d 2 


10 


r3 



Each aggregated value (for each department) is computed 
very similarly to the computation in Example \3.J\ Consider 
the provenance annotation of the first tuple: intuitively, we 
expect it to be \ K if at least one of the first two tuples of R 
exists, i.e. if at least one out of r\ or r^ is non-zero. Indeed 
the expression is 8{t\ + k r 2 ) and if we map ri,r 2 to e.g. 2 
and 1 respectively, we obtain Sj^(3) — 1. 

We use SPJU-AGB as the name for relational algebra 
with the two new operators AGG and GB. We note that 
the poly-size overhead property is still fulfilled for queries 
in SPJU-AGB ; commutation with homomorphism also ex- 
tends to SPJU-AGB (see proof in the Appendix). 

Recall that an additional desideratum from the semantics 
was bag / set compatibility. Recall that sets and bags are 
modeled by K — N and K = B respectively. We next study 
compatibility in a more general way, for arbitrary K and M. 

3.4 Annotation-aggregation compatibility 

The first desideratum we listed was an obvious sanity 
check: whatever semantics we define, when specialized to the 
familiar aggregates of max, min and summation, it should 
produce familiar results. Since we had to take an excursion 
through the tensor product K <g) M, this familiarity is not 
immediate. However, the following proposition holds (its 
correctness will follow from theorems 13. 121 and 13 . 13 p - 

Proposition 3.9. In the following constructions: B (g> 
MAX, B ® MIN, and N ® SUM, u : M -> K ® M where 
i{m) — 1 K ® m is a monoid isomorphism. 

and this means our semantics satisfies the set /bag compati- 
bility property because in these cases computing in K ® M 
exactly mirrors computing in M. 

But of course, we are also interested in working with other 
semirings, in particular the provenance semiring, for which 
N[A]®M and M are in general not isomorphic (in particular, 
1 is not surjective and thus not an isomorphism). In fact, 
the whole point of working in N[X] <g> MAX, for example, 
is to add annotated aggregate computations to the domain 
of values. Most of these do not correspond to actual real 
numbers as e.g. t(MAX) is a strict subset of N[X] <g> MAX 
(and similarly t(SUM) is a strict subset of N[X](g>SUM etc.). 
However, when provenance tokens are valuated to obtain set 
(or bag) instances, we can go back into t(MAX) (or t(SUM) 
etc.), and then we should obtain familiar results by "strip- 
ping off" the l. It turns out that this works correctly with 
N[A] but not necessarily with arbitrary commutative semir- 
ings K. The reason is that not only that 1 is not an isomor- 
phism, but in general it may be be unfaithful (not injective). 
Indeed 1 : SUM -s> B <g> SUM is not injective: 

t(4) = i(2 + 2) = i(2) + mM t(2) = T®2 + Kmi T®2 = 

= (T V T)®2 = T®2 = i(2) 

This is not surprising, as it is related to the well-known 
difficulty of making summation work properly with set se- 
mantics. In general, we thus define compatibility as follows: 

Definition 3.10. We say that a commutative semiring 
K and a commutative monoid M are compatible if 1 is in- 
jective. 



The point of the definition is that when there is compat- 
ibility, we can work in K <g) M and whenever the results 
belong to i(M), we can safely read them as familiar answers 
from M. We give three theorems that capture some general 
conditions for compatibility. 

First, we note that if we work with a semiring in which 
+ K is idempotent, such as B or S, a compatible monoid must 
also be idempotent (e.g. MIN or MAX but not SUM): 

Proposition 3.11. Let K be some commutative semiring 
such that + K is idempotent, and let M be some commutative 
monoid. If M is compatible with K, then +m is idempotent. 

Proof. t(m) = 1 K ®m = (1 K + K l K )®m = \ K <S>m+ K(8M 
1 K ® m = 1 K <g> (m + M m) = t(m + M m) □ 

Nicely enough, idempotent aggregations are compatible 
with every annotation semiring K that is positive with re- 
spect to + K . K is said to be positive with respect to + K 
if k + K k' = K =>• k = k' = A . . For instance, B, N, § 
and N[X] are such semirings (but not (Z, +, ■, 0, 1)). The 
following theorem holds: 

Theorem 3.12. If M is a commutative monoid such that 
+ M is idempotent, then M is compatible with any commu- 
tative semiring K which is positive with respect to +„ . 

Proof sketch. We define h : K®M -> M as 
h d2i£i ki®mi) = J2j e j m i where J = {j e I\k 3 / 0}. 
We can show that h is well-defined (details deferred to the 
Appendix); since Vm £ M ho t(m) = m, b is injective and 
thus K and M are compatible. □ 

For general (and in particular non-idempotent) monoids 
(e.g. SUM ) we identify a sufficient condition on K (which 
in particular holds for N[X]), that allows for compatibility: 

Theorem 3.13. Let K be a commutative semiring. If 
there exists a semiring homomorphism from K to N then 
K is compatible with all commutative monoids. 

PROOF sketch. Let h! be a homomorphism from K to 
N, and M be an arbitrary commutative monoid. We define 
a mapping h : K <g> M — > M by h (Efci ® rrii) — E/i' (ki)rrii. 
We show in the Appendix that h is well-defined and that 
h o i is the identity function hence l is injective. □ 

Corollary 3.14. The semiring of provenance polynomi- 
als N[X ] is compatible with all commutative monoids. 

Now consider the security semiring S. It is idempotent, 
and therefore not compatible with non-idempotent monoids 
such as SUM. Still, we want to be able to use § and other 
non-idempotent semirings, while allowing the evaluation of 
aggregation queries with non-idempotent aggregates. This 
would work if we could construct annotations that would 
allow us to use Theorem 13.131 in other words, if we could 
combine annotations from §, with multiplicity annotations 
(i.e. annotations from N). We explain next the construction 
of such a semiring §N (for security-bag), and its compati- 
bility with any commutative monoid M will follow from the 
existence of a homomorphism h :§N— > N. 

Constructing a compatible semiring. We start with the 
semiring of polynomials N[S], i.e. polynomials where instead 
of indeterminates(variables) we have members of S, and the 
coefficients are natural numbers. Already N[S] is compatible 
with any commutative monoid M, as there exists a homo- 
morphism h : N[§] —¥ N; but if we work with N[§] we lose the 



ability to use the identities that hold in § and to thus reduce 
the size of annotations in query results. We can do better 
by taking the quotient of N[§] by the smallest congruence 
containing the following identities: 

• Vsi, S 2 eS Si > S 2 => Sl -up] S 2 = Si. 

• Vc e N, s e § - N[ s] s = c - N [s] 0s = 0. 

• Vc 6 N c -j}[s] Is = c. 

We will denote the resulting quotient semiring by SN. It 
is easy to check that the faithfulness of the embeddings of 
N and S in N[§] is preserved by taking the quotient. Most 
importantly, §N is still homomorphic to N. Thus, 

Corollary 3.15. §N is compatible with any commuta- 
tive monoid M . 

Example 3.16. Consider the SUM monoid. Let R,S be 
the following S-relations which by the embedding ofS we take 
as SN-relations: 



A 




30 
10 


T 

Is 



Consider the query: AGG(R U IIs.a(5' ixi R)). Ignoring 
the annotations, the expected result (under bag semantics) 
is 70. Working within the (compatible) semantics defined by 
SN ® SUM, the query result contains an aggregated value of 
(T -sn S +sn S) ® 30 +S <g> 10. We can further simplify this to 
T<g>30 + S(g>30 + S(g)10 = T®30 + S<g)40. This means that 
e.g. for a user with credentials T the query result is 1sn£S>70, 
and we can use the inverse of l to map it to N and obtain 
70. Similarly, for a user with credentials S, the query result 
is mapped to 40. These are indeed the expected results. 

Note that if we would have used in the above example § 
instead of SN we would have (T +§ S) = S so (T +§ S) ® 30 
would be the same as S Cg> 30. For a user with credentials T 
we could either use this, leading to the result of Is ® 40, or 
use the same computation done in the example, to obtain 
l S (g>70. Indeed, in 8® SUM, we have l s <g>40 = ls®70. This 
is the same phenomenon demonstrated in the beginning of 
this subsection for B, where i is not injective, preventing us 
from stripping it away. 

Note also that if we would have used N[§] instead of SN 
then we could not have done the illustrated simplifications. 

4. NESTED AGGREGATION QUERIES 

So far we have studied only queries where the aggregation 
operator is the last one performed. In this section we extend 
the discussion to queries that involve comparisons on aggre- 
gate values. We first demonstrate the difficulties that arise 
in designing an algebra for such queries, then explain how 
to extend the construction to overcome these difficulties. 

Note. For simplicity, all results and examples are presented 
for queries in which the comparison operator is equality (—). 
However the results can easily be extended to arbitrary com- 
parison predicates, that can be decided for elements of M. 

4.1 Difficulties 

We start by exemplifying where the algebra proposed for 
restricted aggregation queries, fails here: 



Example 4.1. Reconsider the relation (denoted R 1 ) which 
is the result of aggregation query, depicted in Example \3.8[ 
Further consider a query Q se i ect that selects from R' all tu- 
ples for which the aggregated salary equals 20. The crux 
is that deciding the truth value of the selection condition 
involves interpreting the comparison operator on symbolic 
representation of values in R' ; so far, we have no way of in- 
terpreting the obtained comparison expression, for instance 
r\ ® 20 + r 2 <g> 10 "equals" 20, and thus we cannot decide the 
existence of tuples in the selection result. 

Note that in the above example, the truth value of the 
comparison (and consequently the set of tuples in the query 
result) depends in a non-monotonic way on the existence of 
tuples in the (original) input relation R: note that if we map 
ri to 1 and r 2 to then the tuple with dept. d\ appears in 
the query result, but if we map both to 1, it does not. The 
challenge that this non-monotonicity poses is fundamental, 
and is encountered by any algebra on (M, 7f)-relations. The 
following proposition, which is the counterpart of proposi- 
tion [3]2l holds (proof deferred to the Appendix): 

Proposition 4.2. There is no (M, K) -relation seman- 
tics for nested aggregation queries with MAX-(or MIN-)- 
aggregation that is both set-compatible and commutes with 
homomorphisms. Similarly for SUM-aggregation and bag- 
compatibility. 

Consequently, a more intricate construction is required for 
nested aggregation queries. 

4.2 An Extended Structure 

We start with an example of our treatment of nested ag- 
gregation queries, then give the formal construction. 

Example 4.3. Reconsider example \4-l\ and recall that 
the challenge in query evaluation lies in comparing elements 
of K®M with elements of M (or K®M, e.g. in case of 
joins). Our solution is to introduce to the semiring K new 
elements, of the form [x — y] where x, y £ Kd§M (if we need 
to compare with m £ M , we use i(m) instead). The result 
of evaluating the query in example \4- 1\ (using M — SUM) 
will then be captured by: 



Dept Sal 




di n (g> 20 
d 2 r 3 ® 10 


S(rx + K r 2 } K 

[n <g> 20 + Km , r 2 <g> 10 = 1 K <g> 20] 
<5(r-3)- K [r- 3 Cg>10 = l K ®20] 



Intuitively, since we do not know which tuples will satisfy 
the selection criterion, we keep both tuples and multiply the 
provenance annotation of each of them by a symbolic equality 
expression. These equality expressions are kept as symbols 
until we can embed the values in M — SUM and decide 
the equality (e.g. if K — N), in which case we "replace" 
it by 1 K if it holds or K otherwise. For example, given a 
homomorphism h : N[X] — > N, h[ri) = hirz) = 1, then 
/i M (n(g)20 +x®A./r 2 ®10) = h{n) ® 20 + 
1 ® 30 7^ 1 <S> 20, thus the equality expression is replaced with 
(i.e. mapped by the homomorphism to) K . 

We next define the construction formally; the idea un- 
derlying the construction is to define a semiring whose ele- 
ments are polynomials, in which equation elements are ad- 
ditional indeterminates. To achieve that, we introduce for 
any semiring K and any commutative monoid M, the "do- 
main" equation K = N[K U {[ci = c 2 ] | ci,c 2 £ K®M}]. 



The right-hand-side is a monotone, in fact continuous w.r.t. 
the usual set inclusion operator, hence this equation has a 
set-theoretic least solution (no need for order-theoretic do- 
main theory). The solution also has an obvious commutative 
semiring structure induced by that of polynomials. The so- 
lution semiring is K = (X, +^,-^,0^,1^), and we continue 
by taking the quotient on K defined by the following axioms. 
For all fci, fc 2 £ K, ci, 02,03,04 £ K ® M: 



ki +^k 2 ~ fci + K k 2 
ki-^k 2 ~ fci + K k 2 

[Cl = Ca] ~ [C 2 = C 4 ] (if Cl =k®M C 2. c 3 = K®M C ±) 

and if K and M are such that 1 defined by i(m) — 1 K ®m 
is an isomorphism (and let h be its inverse), we further take 
the quotient defined by: for all a, b £ K <g) M, 

(*)[o = 6] ~ l Jt (ifft(a)= J1( rft(6)) 
[a = b] ~ K (if h(a) ^ M h(b)) 

We use K M to denote the semiring obtained by applying 
the above construction on a semiring K and a commutative 
monoid M. A key property is that, when we are able to 
interpret the equalities in M, K M collapses to K. Formally, 

Proposition 4.4. If K and M are such that K®M and 
M are isomorphic via i then K M = K. 

The proof (deferred to the Appendix) is by induction on 
the structure of elements in K M , showing that at each step 
we can "solve" an equality sub-expression, and replace it with 



Lifting homomorphisms. To conclude the description of 
the construction we explain how to lift a semiring homo- 
morphism from h : K -> K' to h M : K M -> K' M , for any 
commutative monoid M and semirings K,K'. h M is de- 
fined recursively on the structure of a £ K M : if a £ K we 
define h (a) = h(a), otherwise a = \b® mi = c®m 2 ] for 
some fe, c £ K M and mi , m 2 £ M and we define h (a) = 
[h M (b) ® mi — h M (c) ® W2] • Note that the application of 
a homomorphism h M maps equality expressions to equality 
expressions (in which elements in K' appear instead of el- 
ements of K appeared before). If K' and M are such that 
their corresponding 1 : M —> K ® M defined by t(m) = 
\ K ® M is injective, then we may "resolve the equalities", 
otherwise the (new) equality expression remains. 

4.3 The Extended Semantics 

The extended semiring construction allows us to design a 
semantics for general aggregation queries. Intuitively, when 
the existence of a tuple in the result relies on the result of 
a comparison involving aggregate values (as in the result of 
applying selection or joins), we multiply the tuple annota- 
tion by the corresponding equation annotation. 

In the sequel we assume, to simplify the definition, that 
the query aggregates and compares only values of K M Cg> M 
(a value m £ M is first replaced by i(m) = 1 K g)m). In what 
follows, let R(R 1 ,R 2 ) be (M, Jf M )-relations on an attributes 
set U. Recall that for a tuple t, t(u) (where u £ U) is the 
value of the attribute u in t; also for U' C U , recall that we 



use t |[/' to denote the restriction of t to the attributes in U'. 
Last, we use (K ® M) u to denote the set of all tuples on 
attributes set U, with values from K M ® M. The semantics 
follows: 

1. empty relation: "it <j)(t) — 0. 

2. union: (7?i U R 2 ) (t) = 

E t ' Ssupp(fll) «i(f) ■ n uec/ [i'M = «(«)] if t G aupp(fli) 
+ Mt') ■ n« 6 p[*'(«) = *(«)] U SU pp(i? 2 ) 

Otherwise. 

3. projection: Let 17' C [7, and let T = {t|r/' | t G 
supp(i?)}. Then !![//(*) = 

f £ t ' 6Sw{ «)*(0-IW'[*(«) = f(u)] if * G T 
I Otherwise. 

4. selection: If P is an equality predicate involving the 
equation of some attribute u G U and a value m G M 
then (crp(.R)) (£) = R(t)-[t(u) = i{m)]. 

5. vaZue based join: We assume for simplicity that _Ri and 
R 2 have disjoint sets of attributes, Ui and U 2 resp., and 
that the join is based on comparing a single attribute 
of each relation. Let u'i G Ui and u' 2 G U 2 be the 
attributes to join on. For every t G (K M ® M) UlUU2 : 

(Rl \XiR 1 .u 1 =R 2 .n 2 R2) (t) = 

Ri{t\ Ul )-R 2 {t\u 2 )- K [t{u 1 ) =t(u 2 )]. 

Simple Variants. Natural join (when Ui and U 2 are 
not necessarily disjoint) is captured by a similar ex- 
pression, with the equality sub-expression on the at- 
tributes common to U\ and U 2 \ join on multiple val- 
ues is captured by multiplication by the corresponding 
multiple equality expressions; in the representation of 
cartesian product (denoted by x) no equality expres- 
sions appear (only Ri(t\u 1 )-R 2 {t\u 2 ))- 

6. Aggregation: AGG M (R)(t) = 



1 t(u) — Yjt'£sUpp(R) 

otherwise 



R{t')t 



-t'(u) 



7. Group By: Let U' C U be a subset of attributes that 
will be grouped and u G U\U' be the aggregated at- 
tribute. Then for every t G {K M ® M) u ' u{u} : 
GB v ,, u R{t) = 

{ 5((U V ,R) (t\u,)) tin) = J2 t , eaupp(R) (R(t'y K 

Ilu£U>lt'(u)=t(u)])* Kml t'(u) 







otherwise 



It is straightforward to show that the algebra satisfies 
set/bag compatibility and poly-size overhead; commutation 
with homomorphism is proved in the Appendix. 



Example 4.5. Reconsider the relation in Example p7 
and let us perform another sum aggregation on Sal. The 
value in the result now contains equation expressions: 
S(n + K r 2 )- K [n ® 20 + mM r 2 g> 10 = 1 K <g> 20] 
=W (n ® 20 4-^ r 2 ® 10) 



Given a homomorphism h : N[X] — > N we can "solve" 
the equations, e.g. if h(r\) — 1, h(r 2 ) = and h(rz) = 
2, we obtain an aggregated value of 1 ® 40. Note that the 
aggregation value is not monotone in ri,r 2 ,r$: map r 2 to 1 
(and keep ri,rs as before), to obtain 1 ® 20. 

5. DIFFERENCE 

We next show that via our semantics for aggregation, we 
can obtain for the first time a semantics for arbitrary queries 
with difference on 7^-relations. We describe the obtained 
semantics and study some of its properties. 

5.1 Semantics for Difference 

We first note that difference queries may be encoded as 
queries with aggregation, using the monoid B = ({_L, T}, V, _!_) 
(the following encoding was inspired by [291 [9]): 

R-S= U ai ...a n {{GB {ai ,...a n hb( R x -L b U S x Ti,)) 
xioi,...o„ (R x -U)}. 

J_6 and Tj, are relations on a single attribute b, containing 
a single tuple (T) and (T) respectively, with provenance 1 K . 
Using the semantics of Section [4] we obtain a semantics for 
the difference operation. 

Interestingly, we next show that the obtained semantics 
can be captured by a simple and intuitive expression. First, 
we note that since B is idempotent, every semiring K posi- 
tive with respect to + K is compatible with B (see Theorem 
I3.12p . The following proposition then holds for every K,K' 
and every two (B, A")-relations 7?, 5* (proof deferred to the 
Appendix): 

Proposition 5.1. For every tuple t, semirings K,K' such 
that K m (g> B is isomorphic to B via i(m) = \ K ® m, if 
h : K — > K' is a semiring homomorphism then: 
^([(R - S)(t)]) = h* ([S(t) ® T = 0]. x R(t)). 

The obtained provenance expression is thus "equivalent" 
(in the precise sense of Proposition I5.1jl to [S(t) ® T = 
0] • R(t). The following lemma helps us to understand the 
meaning of the obtained equality expression: 

Lemma 5.2. For every semiring K which is positive w.r.t. 
+ K and h : K B, h M ([S(t) <g> T = 0]) = T iff h(S{t)) = 
±. 

Proof. It is clear that if h(S(t)) = _L, h M ([S(t) ® T = 0]) 
[h (S(t)) ® T = 0] = [_L ® T = 0] = T. For the other direc- 
tion, assume that h(S(t)) = T. Thus [h (S{t)) ® T = 0] = 
[TgT = 0]. Since B and B are compatible, t : B ->• B ® B is 
injective; thus t(T) 7^ t(T); consequently h M ([S(t) ® T = 0]) 
[T g) T = 0] = _L. □ 

Consequently, the semantics can be interpreted as follows: 
a tuple t appears in the result of R — S if it appears in R, 
but does not appear in S. When the tuple appears in the 
result of R — S, it carries its original annotation from R. I.e. 
the existence of t in S is used as a boolean condition. 

Example 5.3. Let R, S be the following relations, where 
R contains employees and their departments and S contain- 
ing departments that are designated to be closed: 



ID 


Dep 




1 


di 


h 


2 


di 


t 2 


2 


d 2 


t 3 



Dep || I 
di || U I 



+K® M S(r 3 )- K [r 3 ® 10: 



>20]» 



, r 3 ® 10 



R 



To obtain a relation with all departments that remains 
active, we can use the query (YloepR) — S, resulting in: 



Dep 




di 
d 2 


[*4®T = 0]-(ti +t a ) 
[0 = 0] -f 3 (=t 8 ) 



AToiu consider some homomorphism h : N[X] — > N (mul- 
tiplicity e.g. stands for number of employees in the depart- 
ment). Note that if h(ti) > then the department di is 
closed and indeed di is omitted from the support of the dif- 
ference query result, otherwise it retains each original anno- 
tation that it had in R. Assume now that we decide to revoke 
the decision of closing the department di. This corresponds 
to mapping £4 to 0; we can easily propagate this deletion to 
the query results; the equality appearing in the annotation of 
the first tuple is now [0 = 0] = 1 K and we obtain as expected: 



Dep 




di 
d 2 


ti + 1 2 
ta 



In particular, we obtain a semantics for the entire Re- 
lational Algebra, including difference. It is interesting to 
study the specialization of the obtained semantics for par- 
ticular semirings: B, N, Z, and to compare it to previously 
studied semantics for difference. 

5.2 Comparison with other semantics 

For a semiring K and a commutative monoid M we say 
that two queries Q, Q' are equivalent if for every input (M, K)- 
database D, the results (including annotations) Q(D) and 
Q'(D) are congruent (namely the corresponding values and 
annotations are congruent) according to the axioms of K M ® 
M and K . In the sequel we fix M = B and consider dif- 
ferent instances of K, exemplifying different equivalence ax- 
ioms for queries with difference while comparing them with 
previously suggested semantics. We use Q = K Q' to denote 
the equivalence of Q, Q' with respect to K and B. 

^-relations. For K — B, our semantics is the same as set- 
semantics, thus the following proposition holds: 

Proposition 5.4. For Q, Q' e RA it holds that Q =, Q' 
if and only if Q = Q' under set semantics. 

^-relations. For K — N, we compare our semantics to bag 
equivalence and observe that they are different (for queries 
with difference, even without aggregation). Intuitively this 
is because in our semantics, the righthand side of the differ- 
ence is treated as a boolean condition, rather than having 
the effect of decreasing the multiplicity. Formally, 

Proposition 5.5. Q =, Q' does not imply that Q = Q' 
under bag semantics, and vice versa. 

Proof. Observe that A- (BUB) =, A-B; but this does 
not hold for bag semantics. In contrast, under bag semantics 
(A U B) — B s A, but not for our semantics. □ 

Example 5.6. Reconsider Example ] 5. 3[ and let ti = t% = 
£3 = £4 = 1. Under bag semantics, after projecting R on the 
department attribute, the multiplicity of the department di 
becomes 2; after applying the difference the department di 
is still in the result, but now with multiplicity 1; in contrast 
under our semantics the department di does not appear in 
the support of the result. 



^-relations. Finally, in [22] the authors have presented Z 
semantics for difference, and have shown that it leads to 
equivalence axioms that are different from those that hold 
for queries with bag difference. It is also different from the 
equivalence axioms that we have here for Z-relations: 

Proposition 5.7. Q = . Q' does not imply Q = Q' under 
Z semantics PI. and vice versa. 

Proof. Under Z semantics it was shown in [22] that (A — 
(B—C)) = (AuC)-B. This does not hold for our semantics. 
In contrast A — (BUB) A — B, but this equivalence does 
not hold under Z semantics. □ 

Deciding Query Equivalence. We conclude with a note 
on the decidability of equivalence of queries using our seman- 
tics. It turns out that for semirings such as B, N for which 
we can interpret the results in B (in the sense of proposition 
15. II above), query equivalence is undecidable. 

Proposition 5.8. Let K be such that K 9 <g> B is isomor- 
phic to B. Equivalence of Relational Algebra queries on K- 
relations is undecidable. 

Proof. The proof is by reduction from equivalence un- 
der set semantics: let cf> be the empty query, i.e. a query 
whose answer always the empty relation. Given two RA 
queries Q,Q' (note that Q and Q' can include difference), 
their equivalence under set semantics holds if and only if 
Q-Q' = K 4> and Q' - Q = K 4>. □ 

6. RELATED WORK 

Provenance information has been extensively studied in 
the database literature. Different provenance management 
techniques are introduced in [141 [3 El E], etc., and it was 
shown in 24, 21 that these approaches can be compared in 
the semiring framework. To our knowledge, this work is the 
first to study aggregate queries in the context of provenance 
semirings. Provenance information has a variety of applica- 
tions (see introduction) and we believe that our novel frame- 
work for aggregate queries will benefit all of these. Specif- 
ically, queries with aggregation play a key role in modeling 
the operational logic of scientific workflows (see e.g. [51 116p 
and our framework is likely to facilitate a more fine-grained 
approach to workflow provenance. 

Aggregate queries have been extensively studied in e.g. 
[121 113] for bag and set semantics. As explained in [12j . 
such queries are fundamental in many applications: OLAP 
queries, mobile computing, the analysis of streaming data, 
etc. We note that Monoids are used to capture general ag- 
gregation operators in [13] , but our paper seems to be the 
first to study their interaction with provenance. 

Several semantics of difference on relations with annota- 
tions have been proposed, starting with the c-tables of [28] . 
The semirings with monus of [19] generalize this as well 
as bag-semantics. Difference on relations with annotations 
from Z are considered in [22] and from Z[X] in [20]. As ex- 
plained in Section 5, the semantics for difference defined in 
this paper is different from all of these. 

There are interesting connections between provenance man- 
agement and query evaluation on uncertain (and probabilis- 
tic) databases (e.g. [30j [15l [6] [3]), as observed in [24] . 
Evaluation of aggregate queries on probabilistic databases 
has been studied in e.g. [361 133) . Trying to optimize the 
performance of aggregate query evaluation on probabilistic 
databases via provenance management is an intriguing fu- 
ture research challenge. 
10 As defined in [22]. 



7. CONCLUSION 

We have studied in this paper provenance information for 
queries with aggregation in the semiring framework. We 
have identified three desiderata for the assessment of can- 
didate approaches: compatibility with the usual set /bag se- 
mantics, commutation with semiring homomorphisms and 
poly-size overhead. After showing that approaches using 
provenance only to annotate the database tuples do not sat- 
isfy all desiderata simultaneously, we considered a different 
framework in which the computation of aggregate values is 
itself annotated with provenance. This has led us to the 
algebraic structure of sernimodules over commutative semir- 
ings of annotations and to a tensor product construction for 
the semantics of annotated aggregation. The first product 
of this approach is a "good" (i.e. satisfying the desiderata) 
semantics for SPJU queries followed by an aggregation or 
a group-by with aggregation. We have further studied the 
challenges that arise in evaluation of queries that apply com- 
parisons on aggregation results, e.g., joins over aggregate 
values, and shown that by careful adaptation of the semi- 
module framework these challenges can be overcome with 
a semantics that satisfies the desiderata. Finally, we noted 
that difference queries may be encoded as queries with ag- 
gregation, and studied the algebra induced for such queries. 

We have exemplified in the paper the application of our 
approach for deletion propagation and security annotations. 
As mentioned in the Introduction and Related Work sec- 
tions, there are various other areas in which provenance is 
useful. Future research will focus on applying our framework 
to the research tasks tackled in these areas. 
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APPENDIX 

A. SPJU ALGEBRA FOR K-RELATIONS 

We recall the full definition of the SPJU algebra for K- 
relations from |24j : 

Empty Relation Vt <j>(t) = 0. 

Selection If R : D — > K and the selection predicate P 
maps each U-tuple to either or 1 then apR : H) u — > K 
is defined by (a P R)(t) = R(t) ■ P(t). 

Projection If R : V> u — ► K and U' C U then n v ,R : 
D u ' -> K is defined by {U. v ,R){t) = 5>-R(*') where 
the sum is over all t' € supp(_R) such that = t. 

Natural Join If R, : H Ui K, i = 1,2 then Ri M #2 : 
Ofiuc/2 _> A ~ ^ defined by tx fl a )(t) = - K 
Rvfa) where U = t\u t , i = 1, 2. 

Union If i? 4 : D 17 -J- A", 2=1,2 then fti U K i? 2 : D 17 -> K 
is defined by U K R 2 )(t) = iii(t) + K Rn(t). 

B. PROPERTIES OF K <g>M 

We show here that the K ® M constructed in Section 
12.31 forms a A'-semimodule, and highlight some of its basic 
properties. 

Proposition B.l. K ® M is a K-semimodule. 

Proof. We show that the six semimodule axioms (defini- 
tion [2711) hold. Four of them hold already for bags of simple 
tensors. For example 

(k - K k') * KSM J2 ki^rrii = J2( k K k ' k ki)®mi = 
= k *k®m E ( k ' K k i)®mi = k * Kmi (k' * Kmi J2 k i®mi) 

By taking the quotient by congruence defined in Section[!0] 
we also get the remaining two axioms: 

(k + K k') * Kmi J] ki®nii = Y^( k K k i +K k ' K k i)® m i ~ 

~ E ( k k k i)®mi + Kmt (k' - K ki)®m,i = 
= E -K ki)®mi + Kmi E ( k ' k k i)®rni = 

= k *K®M E k i® m i +Klg,M k *K®M E ^i® m i 

and 

K * k ®m E k i®mi = J2 K <®m,i ~ Y. 0k®m = Orw 
This concludes the proof. □ 

Furthermore, K®M is the "most economical" K-semimodule 
in the following sense. Define i : M — ► K®M such that 
i(m) = l^Cgim. Every tensor is a linear (with respect to K) 
combination of simple tensors from i(M). More precisely, 

Proposition B.2. K®M satisfies the "universality" prop- 
erty, i.e. for any K-semimodule W and any homomorphism 
of monoids f : M —¥ W there exists a unique homomorphism 
of K -semimodules f* : K®M — > W such that /* o l — f. 

Proof. Define /* first on bags of simple tensors as fol- 
lows 

/*(E ki®m) = J2 k i *w f( m ) 

where the second sum is taken with + w (in particular, the 
empty such sum is by convention Ow). Thus f*(b(m)) = 



/*(l K 8m) = 1 K * w f(m) ~ f( m ). Then, one can check 
that /* is a homomorphism with respect to + K ^ M , K(8M and 
*k®m ■ This implies that for /* to preserve ~ it suffices to 
preserve the four laws of the congruence given in Section ^. 21 
which is readily checked, for example 

f(k(g>m + KISM mm') = k* w f(m) + w k v f{m) = 

= = k\ v f{m+ M m') = /* (k(g> (m+ M m')) 

Since /* preserves ~ it can be defined as above by picking 
a representative from each equivalence class. Now let g : 
K®M — > W be another linear function such that g a i = f. 
Then 

g(J2 k i® m ) = fl(E fc i *k®m ( l K® m )) = E fc i *w 9(!/f® m ) = 

= E k i V /( m ) = /*(E k i®rn) 

hence g = f* , thus verifying the uniqueness of /*. In partic- 
ular, any linear function on Kd§M is completely determined 
by its behavior on the tensors in l(M). □ 

We also say that K ® M is the if-semimodule freely gen- 
erated by M (thus the "most economical" appellation). 

Recall that we have defined in Section \2. 31 the "lifting" of 
a homomorphism of semirings h : K — > K' to a homomor- 
phism of monoids h : K®M — > K' ®M. Its definition is an 
immediate consequence of Proposition IB. 21 indeed K' ® M 
becomes a if-semimodule via h so we can define h M as the 
the unique homomorphism of if-semimodules that by the 
proposition extends J : M —> K'®M. Note that this yields 
the definition in Section \2. 31 

h M (Y, k ^® m i) = Y, h ( k i)® m i 
C. ADDITIONAL PROOFS 

Proof. (Proposition 13. lj) 

The "if" direction follows from the fact that homomor- 
phisms, by definition, preserve {+, •, 0, l}-expressions. For 
the "only if" direction we use abstractly tagged 24 databases. 
These are N[X]-databases in which each tuple is annotated 
by just an indeterminate in X, and a different one at that. 
It is as if tuples are annotated by their distinct id. Clearly, 
there is a canonical way of choosing a large enough X and 
producing a canonical abstractly tagged database that is 
completely determined by its support. Now, fix an operation 
Q, and consider its semantics on A"-databases, for various K. 
For any K and any ^-database D, let D a be the abstractly 
tagged database determined by supp(D). Let h : N[X] — > K 
be the homomorphism uniquely determined by mapping the 
abstract tags in X to the actual annotations of D. Fix a 
tuple t in supp(f2(D a )) and let p t = Q(D a )(t) £ N[X] be 
the polynomial that annotates it in £l(D a ). By commuta- 
tion with homomorphisms and the definition of hn e i we have 

ti{D)(t) = rt(h Rel {D a ))(t) = h Rel (n(D a ))(t) = h( Pt ). us- 

ing the homomorphism properties we move h inside pt until 
it applies to just indeterminates. For an indeterminate, say 
x, h(x) is the if-annotation of the unique tuple in supp(D) 
that is annotated with x in D a . It follows that £l(D)(t) is 
given by the {+, •, 0, l}-expression pt in terms of the anno- 
tations of D. But pt only depends on supp(D) and Q while 
it is the same for all K. This shows the algebraic uniformity 
of the semantics. □ 



Commutation with homomorphismfor SPJU-AGB queries. 

We next prove that the semantics proposed for restricted ag- 
gregation queries (in Section 3) satisfies commutation with 
homomorphisms: 

Proof. The proof is by induction on the query structure, 
but since commutation with homomorphisms was already 
shown for SPJU queries, we need only to prove that GB 
commutes with homomorphisms as well. Let R be a _R"-Rel 
on the set of attributes U, where U',U" C U and U' n 
U" = 0. R may be the result of applying any sequence of 
SPJU operations that appear in the query Q, followed by 
GB u , tU „{R). 

Consider the result when first applying the GB operation 
followed by ha e i. According to definition 13.71 the result of 
applying GB will be a relation R' , whose support contains 
every tuple t such that: 

1. i is defined on the attributes U' U U"; 

2. For some non-empty subset T — {t[, t' n } of supp(_R), 
the restriction of t to attributes of U' is equal to the 
restriction of every tuple t' t 6 T to U', and not equal to 
the restriction to U' of any other tuple in supp(i?) — T; 

3. For each u £ U" , t(u) = E t < eT ii(t-) ® t' x (u); and 

4. R'(t) = s(E t , eT R(t^. 

The effect of applying hn e i on such t would then be: 

1. {hnei(R')) (t) is equal by definition to h(R'(t)) = 
h (5 (E t , eTJ R(tO)) = S (p t , £T h (fl(tQ)) • 

2. For every u £ U", t{u) will be replaced by h M \t(u)) = 
h M (£ t < 6T i?(i-) ® £•(«)) = ^erKRit'i)) ® £•(")• 

3. The rest of the values in t remain intact. 

Then, let us check the result of applying hu e i before the 
GB operation, and compare it to the above result. Applying 
fiR e i on R will only affect the tuple provenance annotations; 
for every tuple t', (hR e i(R)) (t') = h(R(t)). Now let us apply 
GBxji t v" {hiiel(R))- Again, according to our semantics, the 
result will be a relation R" , whose support contains every 
tuple t such that: 

1. t is defined on the attributes U' U U" (as before); 

2. For some non-empty subset T = {t[, t' n } of 
supp(ftfl e ((i?)), the restriction of t to attributes of U' is 
equal to the restriction of every tuple t[ £ T to U' , and 
not equal to the restriction to U' of any other tuple in 
supp(h Re i(R)) -T; 

3. for each u G U" , t(u) = E t / eT (hR e i(R)(t'i)) <g>tj(u) = 
£ t / ST /lGR(t<))®^(«); and ' 

4. R'(t) = s(z t , eT h(R(t' t ))y 

We now need to verify that these results are indeed equiv- 
alent. Note first that applying hjiei on R before applying 
the GB operation, only affects the tuple annotations, and 
not their values. We then employ a "by-case" analysis to 
verify equivalence. For any tuple t that is both in the sup- 
port of R and in the support of hn e i (R), it is easy to observe 
from the above equations that the "contribution" of t to both 
the aggregation value and its annotation is the same in both 
cases. Consequently we can focus on tuples in supp(_R) for 
which hud sets their annotations to 0, thus deleting them 
from their groups or even deleting a whole group by deleting 
all its members. Those tuples and groups contribute to the 
result in R' , when applying GB first (before applying fiR e i). 



However, this means that the summands corresponding to 
the annotations of those tuples in the 5 annotation of the 
groups in R' will be later set to by hjieV, as for the ag- 
gregation results, for every tuple t' that is deleted by h, its 
summand h(R(t')) ®t'(u) in each aggregation result will be 
set to ® t'i («) = K8JVf and thus it will have no affect on the 
aggregation results. We thus conclude that if no group has 
been deleted altogether, the results are equivalent. The last 
case to consider is that where all the annotations of tuples 
in group T are set to zero. In this case, its 5 expression will 
be equal to zero as well, so the group is effectively deleted 
also by hn e i after the GB. 
This concludes the proof. □ 

Proof. (Thm. l3T2|l 

Let K be a commutative semiring which is positive with 
respect to + K and define h : K®M — >• M as h(^2 ieI ki®mi) = 
YljeJ rrii wnere J = {j £ ^ | / 0}. We can show that h 
is well-defined (see below); since Vm € M ho i(m) — m, i 
is injective and thus K and M are compatible. 

We need to verify that h is a well-defined mapping, and for 
that we check that it is well-defined on K®M after taking 
the quotient (as defined in Section \2. 2 \ : 

• (For k, k' ^ K ) h((k + K k') ® m) = m = m + M m = 

h{k® 

• h(0 K ® m) — M , and also the empty bag is mapped to 
the "empty sum" i.e. h(0 Kmf ) = M . 

• (For k =fc 0^) h(k ® (m + M m')) = m + M m' = h(k <g) 
m + K0 M k®m'). 

• h(k ® M ) = M , and again h(0 KS)M ) = M . 

Note that we assumed that k and k' are non-zero in the 
first and third axioms. Since K is positive with respect to 
+ K , no such k, k 1 can satisfy k + K k' = K , thus the case 
of K ® m is uniquely defined to be mapped to M , by the 
second axiom. 

This concludes the proof. □ 

Proof. (Thm. [3713]) 

Let h' be a homomorphism from K to N, and M be an 
arbitrary commutative monoid. We define a mapping h : 
K ® M -J- M, as follows. 

h (T,ki ®m,i) — ~Sh' (ki)rrii. We show that h is well-defined 
and that h o t is the identity function. 

We first show that this mapping is well-defined, i.e. that 
every pair of elements from K <g> M which are equated by 
the axioms of the tensor construction (as defined in Section 
12. 2[) . are mapped by h to the same values. 

1st axiom. Left side: h ((k + K k') <g> m) is equal by the def- 
inition of h to h'(k + K k')m. Since h! is a homomor- 
phism, this is equal to (h'(k) + h' (k 1 )) m. Right side: 
h{k® m + KSlM k' ® m) = h'(k)m+ M h'(k')m by h def- 
inition. Since h'(k),h'(k') are natural numbers, this is 
equal to the result of the left hand side. 

2nd axiom. Left side: h (0 K <g> m) — h'(0 K )m. Since h' is 
a homomorphism, h'(0 K ) = and thus the expression 
is equal to 0m = M . Right side: by definition of h, the 
"empty" sum in K(&M must be mapped to the "empty" 
sum in M, which is M . 

3rd axiom. Left side: h(k®(m+ M m')) = h'(k)(m+ M 
m.') = h'(k)m + M h'(k)m' . Right side: 
h(k®m + mM k ® to') = h'(k)m + M h'(k)m'. 

4th axiom. Left side: h(k®0 M ) = h'(k)0 M = M . Right 
side: same as the 2nd axiom. 



Since h is well-defined such that h(a + b) = h(a) + h(b), it 
is a homomorphism from K®M to M. Now we need to show 
that /lot is the identity function, implying that i is injective 
and thus that M and K are compatible. This is easy: since 
h! is a homomorphism it must map \ K to 1; then for every 
m G M, h(i(m)) = h(l K ® m) = h'(l K )m = lm = m. □ 

Proof. (Proposition 14. 2\ 

We show the proof for SUM and the proof for MAX (MIN) 
is similar. Reconsider the relation R' and the query Q SB i ec t 
in Example 14.11 and assume that R ff — Qseiect(R f ) is a 
(A'/, N[X])-relation capturing the query result according to 
some algebra. Assume by contradiction that the algebra 
commutes with homomorphism. Let h, h be homomorphisms 
from N[X] to N corresponding to those in Example 13.21 
I.e. h(n) = h(r 3 ) = h'(ri) = h'(r 2 ) = h'{r 3 ) = 1 and 
h(r<i) = 0. We saw in Example 14.11 that the aggregation 
result when h is applied is 20; thus, in order to be bag- 
compatible, h {R") must include a tuple t" representing 
this tuple which matched the selection condition. Let p t u G 
N[X] be its provenance annotation, then h(p t ") ~ 1. How- 
ever, h (R") is empty, since no aggregation result is equal 
to 20 in that case. Thus h'(p t n) = 0. Similarly to the proof 
of prop. 13.21 observe that there exists no such polynomial 

Pt» e N[X\. □ 

PROOF. (Theorem 14. 4j) 

The proof is by induction on the structure of elements in 
K . We say that an expression exp g K has a nesting 
level of if for every expression [ci®mi = C2®m2] appearing 
in exp, Ci,C2 £ K (and mi,m2 £ M); exp has a nesting 
level n if each such a, C2 are of nesting level n — 1 or less. 
For nesting level of 0, axiom (*) above allows us to replace 
[ci = ca] with Ik or 0^ . Now, assume that the theorem 
holds for expressions with nesting level re - 1 or less, and let 
exp be of nesting level n. Then for each sub-expression [ci ® 
mi = C2 8f»2], we can replace, by the induction hypothesis 
(and using the axioms above), ci, C2 with elements of K 
and then apply axiom (*) to replace the equality expression 
with an element of K. We can repeat for every equality 
sub-expression of exp, obtaining an element of K. □ 

Commutation with homomorphism for the extended se- 
mantics. We next prove that the semantics proposed for 
nested aggregation queries (in Section 4) satisfies commuta- 
tion with homomorphisms: 

Proof. The proof is by induction on the query structure. 
For each operation we consider two cases: (I) applying the 
homomorphism hn e i after applying the operation; (II) ap- 
plying it before the operation. Both cases must yield equal 
results. 

In what follows we use the same notations, R, Ri, R2 and 
so on, as used in the definition of the extended semantics. 

Union. First, consider case (I), where the union is applied 
first. According to the defined semantics, the result of Ri U 
R2 is a (M, if M )-relation such that for every tuple t in its 
support it holds that: 

1. t is defined (only) on the attributes in U. 

2. t G supp(_Ri) U supp(i? 2 ) 

3. (RrUR^it) = Et'e^CKoW) •* U ueU [*» = 

*(«)] + K E t ' 6SUPP( fl 2) m?) k n„ et/ [*» = *(«)]. 

Then, applying hiiel on _Ri U R2 has the following effect on 
that t. 



1. For every value in t from K ® M, its value changes 
from Efci g) mi to h (Tiki ® mi) = Eh(ki) ig) mi. 

2. The provenance annotation of t is changed according 
to the axioms of the homomorphism lifting, to 

h Rel ((Rx U R 2 )) (t) = 

Et^Mnj h M {Ri{t'))-Y[ ueU [h M {t>{u)) = h M (t(u))]+ 

Now, for case (II), let us apply hn e i first, on both _Ri and 
R2. This would affect only the provenance annotations of 
tuples within this relations, causing perhaps to the deletion 
of some tuples, and the values from K M ® M, which change 
in the same manner as described in item (1) above. Let us 
compute the result of hR e i(Ri) U /i_Rei(i?2). Every tuple t in 
the support of the obtained relation is such that: 

1. t is defined (only) on the attributes in U. 

2. t G supp(i?i) U supp(i?2) and it holds that either 
h(Ri(t))^0 K , or h(R 2 (t))^0 K , 

3. (h Re i(R 1 )Uh Rel (R 2 ))(t) = T, t , esupp ( Rl) h M (Rx(t')) ■ 
Uueulh M (t'(u)) = h M (t(u))]+ K j: t , esupp{R2) h M (R2(t'))- 
U ueu [h M (t'(u)) = h M (t(u))]. 

We next verify that the results are indeed equal, for every 
tuple t there are several options: it can be in the support 
of Ri, i?2, neither or both; and h can set its provenance to 
in none of them, one of them or both. Any tuple t which 
is not in supp(i?i) or supp(i?2) clearly does not affect the 
result. Any tuple who is at least in one of them, will be 
annotated in case (I) with a sum of each annotation of each 
tuple t' in supp(_Ri) U supp(i?2), multiplied by tokens that 
equate each value from k M ® M in t to the value of the same 
attribute in t' . In the worst case, where all the values are 
from K M ® M, we do not know which tuples are equal and 
thus we compare each pair on each attribute. Then applying 
h might cause some of the original tuple annotation, hence 
some of the summands in the provenance of t to become 0. 
In case (II) those tuples for which h sets the annotations to 
are removed from Ri, R2 or both, and thus they do not 
appear in the sum to begin with. For the tuples that remain 
it is clear that the obtained annotations and K M (g> M are 
the same in both case (I) and case (II). 

One thing to note here is that different tuples in a relation, 
Ri for instance, may be equated after applying h,R e i. This 
is true, for instance, when we have two tuples which differ 
only by some aggregation result, but after applying h those 
results turn out to be the same. This works well with the 
homomorphism commutation as well, because it is easy to 
see that in both cases the tuples will be equal, in case (I) 
after applying h on the union result and in case (II) before 
the union is applied. 

The proof for projection is very similar to the one for 
union, thus it is not repeated here. 

Selection. According to the algebra definition, to get (ap(R)) (t) 
we simply multiply the annotation of each tuple t G supp(7?) 
by an expression equating the value of the relevant attribute 
u in t to some value m (embedded into K <g> M using t). 

In case (I) the provenance of some tuple t in the support of 
ap(R) might be R(t) - K [(Efcj ® m.) = 1 K ®m], which would 
become, after applying h, 
h(R(t)) - K [(EA(fti) g> mi) = 1 K , <g>m] . 

In case (II) the annotation of tuple t will become h(R(t)) 
after applying h,R e i, and t(u) would become Y,h(ki) ® mi. 
Thus after applying the selection, we would get the same 
result, h(R(t))- K [(Y,h(h) ® m») = l K ,®m]. 



Aggregation. In case (I), we first apply the GB operation 
GBjji ^R and obtain a relation such that for every tuple t 
in its support it holds that: 

1. t is defined (only) on the attributes in U U {u}. 

2. There exists some tuple t' G supp(i?) such that for 
every attribute u G U', t(u) = t'(u'). 

3. t(u) = E t / ssup p (ii) {R(t')- K Yl u >eu> = *' V)]K®m 
t'(u) 

4. GB V i , u R(t) = 5(S t / eS upp(fl) R(t') 
*lW< '[<(«') =*V)D- 

Now, after applying ft_R e ; on the result, the effect on such 
tuple t would be: 

1. t(u) = X t , eSumR) (h M (R(t')) 

fclW [h M (t(u')) = h M ((t'(u'))])* Kmt h M (t'(u)) 

2. h Rel (GB u ,, u R)(t) = 5(X t , eSupp(R) h M (R(t')) 

VlW [h M (t{u>))=h M {t>{u>))}). 

In case (II), we first apply h Re i, which affects the tuple 
provenances and the values from K M (&M. Then aggregation 
is applied on the result. Each tuple t in 
supp{G Bu^ u h Re i{R)) is such that: 

1. t is defined (only) on the attributes in U U {u}. 

2. There exists some tuple t' G supp(h Re i(R)) such that 
for every attribute u' G U' , t(u') — t'(u). 

3. t(u) = X t , €SVLpp(R) (h M (R(t')) 

vIW [*(«') = ^ M (*'K))]) W ^'H) 

4. GB v , iU h Rel (R){t) = S(Y; t , esupp(R) h M (R(t')) 
VlW [t(u') = h M (t'(u'))]). 

We now verify that the results in both cases are indeed 
equal. In the first case, according to the definition, every 
tuple t in supp(i?) is forming the basis of a group, which 
conditionally may contain every tuple in R (using equation 
expressions to verify that each tuple is indeed in that group 
only if its restriction to U' is equal to the restriction of t to 
£/'). When we apply h, some tuple annotations may be set 
to 0, and thus their corresponding summands (in the group 
provenances and aggregation results) are set to 0, and do 
not affect the result. In case (II) some tuples may be re- 
moved by h from the relation even before the aggregation 
is performed. This has a similar effect to setting their cor- 
responding summands to as in case (I). There is a slight 
difference here: If supp(ii) was of size n, so will be the size 
of the support of GB v i U R, maybe even after applying h Re i 
on it; However, supp(h Re i (R)) may be of size m < n, and 
thus so will be GB V i , u h Re i (R). The reason is that as long as 
we cannot evaluate equation values, we have to allow for n 
different groups (as many as there are in the support), thus 
if there are "actually" less, when we move to a K' M ® M 
where equations may be evaluated, we may get the same tu- 
ple representing the group duplicated the number of times 
as the number of its group members. This is acceptable, 
since duplicates are ignored. However if we apply h first, 
we may know that there cannot be n groups to begin with, 
since some of the tuples are deleted. The effect of this will 
only be less group duplicates. The commutation of the AGG 
operation with homomorphism follows from the above proof 
as well. 

This concludes the proof. □ 
PROOF. (Proposition 15. l\ 

First, note that by the homomorphism commutation, we 
can check the equivalence of using both expressions on h B (R) 



and h B (S), i.e. verify that 

((U ai ... an {(GB {ai: ... anhb {h S {R) x± b U h B {S) x Ti,)) 

M (h S (R) x U)})(t)) = [h S (S)(t) ® T = 0] - K ,h S (R)(t). 
Note that since K' ® B is isomorphic via an isomorphism I 
to B, instead of checking equality in K' ®B we can check for 
equality after application of the isomorphism (in B) ; in par- 
ticular this allows us to interpret the equalities and replace 
them with 1 K or 0^ . 

Now let us follow the operation of difference encoded by 
aggregation step-by-step. Let R, S £ O u —¥ K' B ; let 
supp(h B (R)) = {ri, r„} and supp(/i s (S)) = {si, s m }. 
Thensupp(/i S (i?)x_L b ) = {r^ | r'^b) = ±A3n G supp{h B (R)) 
V« 6 U r'i(u) — ri(u)}; since x is equivalent to join with 

no attribute equations, the provenance (h B (R) x _L(,J (t) is 

h B {R){t)- K , -U(i) = h B (R)(t) for any t G supp(h B (R) x ± b ), 
and , otherwise; similarly for h B (S) x T;,. Now, note 
that the support sets of both relations are mutually exclu- 
sive, and that all the values in the tuples in the support of 
both h B (R) and h B (S) x T b are from D. Thus, it is easy to 
see that (h B (R) x ± b ^u(h B (S) x T 6 ) (t) is h B (R)(t\ v ) for 

t G supp(/i. S CR)x_U), h B (S)(t\u) forte supp(h B (S)xT b ), 
and K , otherwise. Now we apply group- by on the b at- 
tribute. There are four possible classes of tuples: 

1. For every r- such that n = s 3 G h B (R) n h B (S), we 
would get a tuple r" in the support of the result, such 
that r"\u = r'i\u = r, = Sj = Sj-|tr, and r"(b) = 

(h S (R) x L h ) (r't) ® 1 + K , 9 s h§ ( S ) x T f( s ® T = 
h B (S)(ri) <g> T. The provenance of this tuple would be 
S(h s (R)(n)+ KiS h s (S)(n)). 

2. For every r- such that n G h B (R)\h B (S), there is 
a tuple r'i in the group- by result such that r"(6) = 

(h B (R) x -L b (r'i) \ ® -L = K ,^~, and where the prove- 
nance of r'i is 5(h B (R)(ri)). 

3. For every s'i such that Si G h B (S)\h B (R), there is 
a tuple s" in the group- by result such that s"(6) = 

(h S (S) x T 6 (s0) ®T = h S (S)(si)®T, and where the 
provenance of s" is 5(h B (S)(si)). 

4. Every other tuple has provenance , , i.e. it is not in 
the support of the aggregation result. 

Now we need to perform a join of the group-by result and 
h B (R) x J_b. For that we will first rename the attributes of 
the aggregation result to U2 U {&'}. For each tuple t in the 
aggregation result such that t\v G h B (R), i.e. tuples of cases 
(1) and (2) above, there will be a unique corresponding tuple 
in the join result (since all these tuples have unique U values, 

they will join with a unique tuple in h B (R) x _!_{,). The ob- 
tained provenance of the join of tuples from case (1), would 

be 5{h B (R)(n) + h B (s) {n)yh s (R) {n)\h s (S) (n) ® t = o]. 

However, 5{h B {R){ri) + h B (S)(n)) is redundant here: if 
h B (S)(r t ) / K , or h B (R)(ri) = K , , the provenance is K , ; 

otherwise, S(h B (R)(n) + h B (S){r t )) = 8{h B (R)(r t )) = 1 K , . 
As for the join of tuples from case (2), the case is sim- 
pler the provenance is 6(h 8 (R)(n)) ■h S (R)(n) • [0 = 0] = 
S(h B {R)(r i ))-h B (R){r l ), and again S(h B (R)(n)) has no ef- 



feet. For cases (3) and (4) the provenance is K , . This 
matches the result of [(h S (S))(t) <g> T = 0]- K ,(h S (R))(t). □ 



